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Abstract 

Data obtained from ISSR amplification may readily be extracted but only allows 
us to know, for each gene, if a specific allele is present or not. From this partial 
information we provide a probabilistic method to reconstruct the pedigree corre¬ 
sponding to some families of diploid cultivars. This method consists in determining 
for each individual what is the most likely couple of parent pair amongst all older 
individuals, according to some probability measure. The construction of this mea¬ 
sure bears on the fact that the probability to observe the specific alleles in the child, 
given the status of the parents does not depend on the generation and is the same 
for each gene. This assumption is then justified from a convergence result of gene 
frequencies which is proved here. Our reconstruction method is applied to a family 
of 85 living accessions representing the common broom Cytisus scoparius. 
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1 Introduction 


A pedigree is a graph such that each vertex has indegree equal to 0 or 2 and any out- 
degree. When it represents family relationships between living individnals, edges are 
directed from parents to children. By reconstruction of the pedigree of a family of some 
set of individnals, we mean a way to determine the most likely pedigree relating theses 
individuals given some information such as phenotype, genotype, date of birth, data ob¬ 
tained from professional breeders,... It may happen that this information is known only 
for a part of the population or even that the nnmber of missing individnals is unknown. 
To each sitnation corresponds some specihc methods. Deterministic methods based on 
the maximnm parsimony principle and using purely combinatorial arguments allow us to 
reconstruct the minimal pedigree relating individnals in accordance with their types, see 
Chapter 4 in HDl, HH or |2]. There are also numerous different stochastic methods of 
reconstruction of pedigrees, see for instance 0 , IE], la, 0 . In any case, the method 
consists in hnding a ’nice’ probabilistic framework in which we may hnd the most likely 
pedigree relating some set of individuals. Some models focus on the reconstrnction of 
the lineages by estimating transition probabilities between nodes. Reconstrncting the 
pedigree then comes down to the construction of a Markov chain. This method is quite 
popular when making use of identity by descent (IBD) data, [6]. In this case, a statistical 
inference based on Monte Carlo Markov chains and Bayesian statistics are used to infer 
transition probabilities between nodes of the graph, [12] and [13] . Coalescence theory may 
also prove to be a powerfnl tool in reconstrnction of pedigrees, as observed in [T3] . 

In the present work, we assume that the known information is of a genomic type and is 
provided through ISSR amplihcation for diploid plant cultivars, which are vegetatively 
propagated. ISSR amplihcation was popalnrised by [16] and largely used in genetic diver- 
stity assessment [8] . Because being vegetatively propagated, the available dataset contains 
both descendants and ancestors in the pedigree, thus both terminal and internal nodes of 
the graph, while most above listed methods use information from last generation descen¬ 
dants (i.e. terminals in the graph). We know the same genotypic information for each 
individual and we assume that there are no missing individuals in the set. ISSR data only 
allows us to know, for each gene, if a specihc allele is present or not. In particular, in the 
case of presence, we do not know if this specihc allele is present in both chromosomes (i.e. 
at homozygotic state, and transmitted to all the descendants) or if it is present only in one 
of them (i.e. at heterozygotic state and thus transmited to only half of the descendants). 
It actually stems as if we observed the phenotypic expression of a dominant gene and our 
model can also be applied to this kind of sitnation (see the discussion at the end of this 
paper). Then from this partial information we provide a probabilistic method to recon- 
strnct the pedigree corresponding to some families of diploid plant cnltivars. This method 
consists in determining for each individnal what is the most likely couple of parent pair 
amongst all older individnals, according to some probability measure. More specihcally, 
ii gi,..., Qn are individuals ranked in their birth order, then for each i = 1,..., n, we are 
looking for a couple of individuals possibly non distinct in the set {^^i,..., gi-i} which is 
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the most likely parent pair of gi according to some probability measure. The construction 
of this measure bears on the fact that the probability to observe the specihc alleles in the 
child, given the status of the parents does not depend on the generation. It only depends 
on the gene frequencies which are supposed to be constant in time. In order to justify 
this assumption, we prove here that gene frequencies converge almost surely, as the num¬ 
ber of crossbreeding increases, toward an equilibrium which satishes the Hardy-Weinberg 
condition. 

Our reconstruction method is applied to a family of 85 living accessions representing the 
common broom Cytisus scoparius and related cultivated hybrids (Cytisus x dallimorei, 
Cytisus X boskoopi). The latter are diploid sexed plants whose crossbreedings have oc¬ 
curred in the past 200 years from a set of founders which is to be specihed by our model. 
For each individual, 6 markers are used to highlight presence or absence of a particular 
allele in a high number of distinct regions of the genome. These 6 markers provide a total 
of more than 420 distinct bands for these 85 accessions, and each band has been treated as 
present or absent for each individual. The results of our model applied to these particular 
data are described in Section Section is devoted to the presentation of the model as 
well as to the convergence result of gene frequencies which justihes its relevance. Then we 
give some conclusions in Section comparing our results to the existing literature and 
highlighting some other frameworks where our method can be used. 


2 Materials and Methods 

2.1 Model overview 

We represent a pedigree as a directed graph in which each vertex corresponds to an in¬ 
dividual and each directed edge corresponds to a parent-child relationship, with the edge 
going from parent to child. The individuals are partitioned into two sets, F and F^, 
referred to as the founders and the non-founders respectively. The pedigree specihes, for 
every non-founder individual, two (not necessarily distinct) individuals which, according 
to some probabilistic model shortly dehned, are the most likely parents. 


We hrst dehne the law of reproduction in the population. Let n be the number of individ¬ 
uals, denoted gi,...,gn and let m G N be the number of genes for which we observe the 
presence or absence of a specihc allele. More specihcally, when proceeding to the ISSR 
amplihcation, for each gene, we receive from some marker, a binary response: either the 
allele is present in at least one of the two chromosomes or it is absent in both. In partic¬ 
ular, when the allele is present, we do not know if it is present on the two chromosomes. 
Actually, it is equivalent to consider that the allele which is highlight by the marker is 
dominant and that we only observe the phenotype of the individual. For each individual 
gi and each gene £ G {!,..., m}, let x^{gi) G {0,1} be the indicator of band absences (0- 
values) and presences (1-values) of individual g^ obtained during the ISSR amplihcation 
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process. Hence the apparent genotype of each individual g will be identihed to the element 
x{g) := {xi{g),X 2 {g), ■ ■ ■ ,Xm{g)) of {0,1}™. Note that the event {xii^g) = 1} means ’’one 
observes the presence of the allele specihc to gene i in individual or equivalently ’’the 
allelic combination of gene £ in individual g is 01 or 11”. 


Each individual g has an associated date of birth, denoted t{g). We set t{g) = 0 if 
the individual g was obtained from the wild, in which case it will be considered as a 
founder. Otherwise set t{g) equal to the date the individual was accessioned. We order 
the individuals so that for i < j, t{gi) < t{gj), whenever t{gj) > 0 (it is assumed that 
dates of birth are distinct from each other). The basic principles of our reconstruction 
method are: 

(a) a uniform prior on probability {gj,gk) are the parents of individual gi over all pairs 
i9j,9k) with max{t{gj),t{gk)) < t{gi); 

(b) no missing individuals, that is the parents of each non-founder individual g^ belong 
to the set {^fi, ...,gn}\ {gt}. 

Let us denote by g and g the parents of the individual g. When they breed, the two 
parents g and g with respective apparent genotypes x{g) and x{g) will give birth to the 
individual g with apparent genotype x{g) according to the following rules: 

(c) independence of the coordinates of x{g), that is, {xi{g) = 1} and {xi/{g) = 1} are 
independent for all £' ^ £] 

(d) there are constants 5 G (—1/2,1/2) and £ G (0,1/2) called the errors and for each I, 
there are constants pi G (3/4,1) and qi G (1/2,1) such that for each individual g 
and 


- P({a;Kd) = 1}| {xi{g) = 1}, {xi{g) = 1}) = min(p^ - d, 1), 

- ^{.{xi{g) = 1}| {x^{g) = 0}, {xi{g) = 1}) = min(g£ - d, 1), 

- P({a;Kd) = 1}| {xi{g) = 0}, {x(,{g) = 0}) = £. 

Principles (a) and (6) should rather be considered as the most natural assumptions in the 
absence of any particular constraint in the evolution of the population. Note that accord¬ 
ing to (a), the father and mother can be the same individual, which is standard in plant 
populations. Principle (c) means that the evolutions of different genes are independent 
between each other. In our specihc example we will select a particular set of genes whose 
independence will be checked by means of a statistical test, see Section]^ 

Let us now concentrate ourself on principle (d). Constants d and £ are actually exper¬ 
imental errors, so they do not depend on gene £. It appears that when the parents 
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satisfy {xi>{g) = l},{xi{g) = 1} (resp. {xi{g) = Q},{xi[g) = 1}), the probability to 
observe {x(^{g) = 1} for the child is less than the theoretical probability (resp. qi), 
that is Pi — 6 (resp. qi — 6). Similarly, it can happen that when the parents satisfy 
{xi{g) = 0}, {xi{g) = 0} one observes {xi{g) = 1} for the child. This dehnes error e. As 
showed hereafter, we have pi G (3/4,1) and qi G (1/2,1), and the estimation from our 
data, see Section shows that 6 and e are actually of order 0.1. 

Besides, we recall that despite the reproduction is sexed, since we are concerned with 
plant populations, each individual can either be male or female, so that when referring to 
the parents gj and gk of the individual gi, the mother and the father are not distinguished. 
In particular we have P({x£(p) = 1}| {xi{g) = 0}, {xi^g) = 1}) = F{{xi{g) = 1}| {xi{g) = 
= 0 }). 

We now focus on the computation of the conditional probabilities appearing in (d). In 
order to compute the theoretical values pi and let us assume that there is no exper¬ 
imental error, i.e. 5 = £ = 0, so that expressions in (d) are F{{xi{g) = l}\{xi{g) = 
l},{xi{g) = 1}) = Pi and ¥{{xi{g) = l}\{xi{g) = 0},{x£(g) = 1}) = qi. Let us now 
compute Pi and qi in terms of the gene frequencies. We will prove in the next section 
that for each gene, the frequencies of the three genotypes 00, 01 and 11, converge toward 
some equilibrium, as the number of crossbreeding increases. Let us denote respectively 
by Tioo{i), 7roi(£) and vrii(£) these frequencies. Then in our model, we assume that this 
equilibrium is attained, so that: 

(e) 7roo(^), vroi(f') and 7rii(f') do not depend on time. 

Note that here, by time, we mean a scale which is incremented by successive crossbreed¬ 
ings. Assumption (e) will be justihed in the next section. When no confusion is possible, 
we will forget about the index i in 7ioo{£), vroi(f') and vrii(£). Let us compute pi and qi in 
terms of ttoo, vth and ttoi. For a pair of parents {g,g) chosen uniformly at random in the 
sub-population {g' : t{g') < t{g)}, the probability to observe Xi{g) = 1 and Xi{g) = 1 is 

= 1}, {xiig) = 1}) = Ai + 27roi7rii -F . 

When they breed and give a child g, the probability to observe Xi{g) = 1, Xi{g) = 1 and 
Xi{g) = 1 is 

= 1}, {xiig) = 1}, {xi{g) = 1}) = + 27roi7rii -h ■ 

We obtain that at any time, pi is given by 

_ Ai + ^TToiTTii -F 37rgi/4 _ ^_ Trgi 

-F 27roi7rii -F 4(7roi Trii)^ ‘ 

Then qi is obtained in the same way: 


TToi -|- 27rii 





The frequencies vtoo, vtoi and tth belonging to (0,1) it is easy to check from the above 
expressions that pi G (3/4,1) and qi G (1/2,1). Furthermore, we have the relationship 
P£ = qi{2 — qi). In Theorem we show that in fact the triplet of gene frequencies 
(tJ'oo, TJ'oi, TTii) satishes the Hardy-Weinberg equilibrium, that is ttoi = S^ttootth and using 
this relation, we deduce that 


<le 



1 + 2a/Fqq 

(1 + a/^oo)^ 


( 2 . 1 ) 


We shall now dehne the set of probability measures p from which the most likely pedigree 
will be derived. This dehnition is based on the conditional probabilities: 

m 

P(a;(5() = a \ x{g) = a, x{g) = a) = JJP(a;(;(5() = ae \ Xi{g) = ae, xi{g) = ae ), 

i=i 


which are obtained from all acceptable triplets of individuals {g, g, g) and their appar¬ 
ent genotypes a = (oi,..., am), a = (hi,..., a^) and a = (hi,..., hm) in {0,1}™. 
More specihcally, the set of individuals {gi,... ,gn} and their apparent genotype being 
given, for all triples {i,j,k) G {1,... ,n}^ and for each gene i, we first dehne the agree¬ 
ments/disagreements indicators between the genotype of an individual gi and this of the 
possible couple of parents {gj,gk)- 

(^) _ 1 Ai) _ -I 

Pijk ~ ^{xtigj)=^i(9k)=^eigi)=A ’ Pijk ~ ^{^eigj)=o!:e{gk)=^ ,xiigi)=0} y 

A) _ 1 -((■) _ -I 

^ijk ~ ^{xi(gi)¥^xt(gk) ,xt{gi)=A ’ ^ijk ~ ^{^tigpj^^tigk) ,xe{gi)=^} ’ 

m m 

£=i e=i 


Now define ps,e = mm{pi-6, 1), qs/ = mm{qe-6, 1), ps,e = i-P5,e, q5,i = e = 1-e 

and 




„(0 no 

j.) = .1 £'•« ■ nr., iff ■ iff ■ iff ■ iff . if j <;; < *. 

0, otherwise. 


Then for each i = 2,..., n, the probability measure /ij on {1,..., n}^ is explicitly dehned 
in terms of x by 

Pi{j,k) = -, G {!,...,n|, 

where Zi := Yhjk k) is a normalising constant. We readily check that Zi > 0 for all i 
such that t{gi) > 0. Moreover, individuals gi such that t{gi) = 0 are necessarily founders 
(i.e. gi ^ F), hence their parents do not belong to the current pedigree, so in this case, 
we set 

Pi{j,k) = 0 , j,k e n}. 


Fix a threshold probability p G (0,1). Then an individual gi is in the set of non founder 
individuals, only if there exists a pair {j,k) G {!,...,such that pi{j,k) > p with 
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i < k < i {it follows that the partitioning depends on the value of p). 


For each individual gi G F^, we wish to determine gj and gk (possibly equal), such that 
the following two conditions are satished: 

1. j < k < i {gj and gk accessioned before p,); 

2. g,i{j, k) = maxj/^A:'{h^(J^ k') : j' < k' < i} {gj and gk maximize the likelihood). 

n 

We remark that by dehnition of it follows that if we have found such a pair gj and 
gk, then k) > p. 

Note also that the normalization of the probability measure p is relevant only for the 
comparison with the threshold probability. Steps 1. and 2. define the algorithm from 
which we performed the program in R which provides the reconstructions of pedigrees, 
see Section m 


2.2 Convergence to equilibrium 


In this subsection, we are interested in the dynamics of the frequencies of each genotype in 
the population. As already mentioned in the previous section, our reconstruction method 
strongly bears on the assumption that the frequencies ttoo, tiqi and tth of the types 00, 01 
and 11 do not depend on time, that is condition (e) in subsection 2.1 We will show in the 


present subsection that as the number of crossbreeding goes on, these frequencies converge 
almost surely to some random equilibrium. This result actually justihes assumption (e). 


From time n = 0, we rank the crossbreedings in increasing order as they occur. Since the 
evolutions of genes are independent of each other, see assumption (c), we only need to 
consider the dynamics of the frequencies of genotypes 00, 01, 11 for one gene. Then let 
us denote by ttqq, ttq^ and the proportion of individuals g with genotype 00, 01 or 11 
respectively, after the n-th crossbreeding. Let us assume that we start at time n = 0 with 
two founders, so that after the n-th crossbreeding, n -|- 2 individuals are present in the 
population. That assumes in particular that there is no death. Moreover we assume that 
both alleles exist in the two founders. Then our reproduction law described in {a)-{d) of 
the previous subsection may actually be represented as a generalized urn model in which 
the probability of replacement depends on the proportion of individuals in the population, 
see and the references theirin. More specihcally, at each step n (crossbreeding), con¬ 
dition (a) tells us that we choose two individuals uniformly at random in the population. 

Let us dehne the polynomial function F : {{x,y,z) G [0,1]^ : a; -|- y -|- ^ = 1} —)■ by 

F{x, y, z) + {x, y, z) = {xy + x^ + y‘^/4:, xy + yz + 2xz -|- y‘^/2, yz + z'^ + y'^/F ), 
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and denote by iS = {(x, y, z) G [0,1]^ : F(x, y, z) = 0} the zero set of F. 

We construct tt” recursively. Write F = (Fi, F 2 , F 3 ). At each step n, two uniformly chosen 
individuals from the population breed and the new frequencies of individuals with types 
00 , 01 and 11 become: 


f 

{ 

[ 

f < 0 ^^ 

{ 

[ 

f 

{ 

[ 


(n+ 2 ) 7 rgQ+l 

n+3 

(n+ 2 ) 7 rg^ 
n+3 ’ 

{n+2)-K^^ 

n+3 

(n+2)7rgp 

n+3 

(n+ 2 ) 7 rJ'j+l 

n+3 

(n+2)7r^^ 

n+3 

(n+ 2 ) 7 rgp 

n+3 

(n+ 2 ) 7 rg^ 

n+3 ’ 

(n+2)7r^^+1 

n+3 


with probability + (^ 00 )^ + 


with probability + ^oWii + 27roo7rii + «i)V 2 = i^ 2 ( 7 r”), 


with probability + { 7 ^ 11 ^ + ('^oi)^/4 = F-^ij;^). 


Let us make this construction more formal. First we define a stochastic process (5n)n with 
values in {(1, 0, 0), (0,1,0), (0, 0,1)} in such a way that the law of Sn+i conditionally on 
7T^ = io, ... ,71"^ = in is F{in). Recall that the quantity (n + 2)7r"' represents the population 
size at time n. Then is dehned by 

(n + 3)7r”’''^ = (n + 2)7r” + 6n+i , n > 0 . 

Let us set 

rjn = Sn+i - R(7r”), 

then we readily obtain the following equality 

7r”+' = tt" + ^(F(7r") - tt" + yn)- (2.2) 

n + 3 

For u e [0,1]^, let /„ : U {0} —)■ [0,1]^ be the solution to the ODE 

/ im =F(u(t)), t>o, 

I +(o) = u. 

The solution can be calculated explicitly and we easily check that with /„(t) = {xu(t), yu(t), Zu(t)) 
and u = (xo, yo, 2 : 0 ); then 

x4t) = (lo - e-‘ + 2s±s«l! 

' yjt) = -2 e“' - + 2 x 0 + yo 

z,.(t) = 1 + (x„ - 12 «!±!s£) e-‘ + - 2x„ - y„. 



We aim to show almost-sure convergence of tt"' = (ttqo) ^ii) as n —)■ oo. The hrst 
step in achieving this is to show almost-sure convergence of v{'k'^) as n —)• oo, where 
v{u) := hmt_^oo This is achieved in the following lemma. 

Lemma 1. As n ^ oo, n(7r”) converges almost surely. 


Proof. We shall show that almost surely, {v{7i"‘))n is a Cauchy sequence. We have 




n vr-T—-F(7r") -t;(7r") 

' n + 6 ' 


+ 


- V I 7 r'‘ + 


n -I- 3 


F(^”) 


(2.4) 


We provide upper bounds on each term appearing on the right-hand side. Firstly, using 
the fact that v{x) = v{fx(t)) for any f > 0, 


n tt" -F 




n + 3 

We have the explicit form of v as 


n(vr- + —FK))-n(A 


n -I- 3 


v{u) = 


{2xo + yoy {2xo + yoy 


+ 2 x 0 + 2/05 1 + 


( 2 xo yoY 


- 2 xo - 2/0 5 


for any u = (xo, yo, zq). The function v is clearly Lipschitz on [0,1]^ and so there exists a 
constant c such that 


x(vr- + —FK))-x(A 


n + 3 


tt ” +- F(n^) - 

n+3 ^ ^ 


< c 

< 0 {l/n^), 


n + 3 


since U^{l/{n -F 3)) = /^n(0) -F ;^/;^n(0) -h 0{l/n‘^) = + ^F^n^) + 0(l/n2). For 


n+3 

the second term on the right-hand side of (|2.4|), we have 

1 


x(7r'^+^) - V I 7r'‘ + 


n -I- 3 


F(7r" 


< c 


1 


_ n - —F(7r^‘ 

n + 3 ^ 


< 


n + 3 


\Vn - 


by the dehnition of vr"', see (2.2). However since F is bounded we deduce that we can upper 
bound this term by 0{l/n). Plugging the two bounds we have obtained into equation 
(2.4) shows that the sequence (x(7r"))„ is indeed Cauchy (surely), and this completes the 
proof. □ 


We are now in a position to show almost-sure convergence of the stochastic process vr” = 
(7^005 7roi,7r5^i), ri>l. 


9 




































Theorem 1. The random vector = (ttoq, ttq^, n > 1 has the following asymptotic 
behaviour: 

T^n ^ as n tends to +oo , 

where (vtoo, ttoi, tth) is distributed on S. In particular, it satisfies the Hardy-Weinberg 
eguilibrium: 

TToi = 2y'7roovrii. 


Proof. We first claim that almost surely, the If distance between tt"' and S tends to 0 as 
n —)■ cx). Recall that the distance Ivr” — 51 is defined as 


vr" -5 


min{| 7 r'^ - s|} := min {| 7 r"o - x| + | 7 r" - y\ + - z\}. 

sScS {x,y,z)es 


In fact, this is a consequence of Theorem 2.2 in [9] which asserts that the limit set of 
(tt”) (i.e. the set of limits of subsequences of (tt"')) is almost surely a connected compact 
internally chain recurrent set for the flow associated to the ODE (2.3). In particular the 
limit set of (vr"') is included in 5, which implies that the distance between vr"' and S tends 
almost surely to 0 . 


Suppose X G 5 so that F{x) = 0 by definition. Then = 0 for all f > 0 and so 

fxit) = X for all t > 0, and in particular n(x) = x. Since v is Lipschitz and v{S) = S we 
have that, as x —)■ 5, |n(x) — x| —)■ 0. But since n( 7 r"') converges almost surely to some 
limit random variable, we deduce that 7 r„ also converges almost surely and to the same 
limiting random variable. 


Finally, Hardy-Weinberg equilibrium follows readily from the fact that (ttoo, ttoi, tth) is 
distributed on the set S, i.e. F(7roo, vtoi, tth) =0. □ 


In this theorem, an additional information is brought by the Hardy-Weinberg principle 
which provides a relationship between the allelic frequencies and the genotypic frequencies. 
This equilibrium was predictable and is actually a natural consequence of the absence of 
any evolutive forces. 


Let us now consider the general case m > 1. We denote by ttg the frequency of a genotype 
G = (Gi,..., Gm) G {00,01,11}”^. If vTj^oo, ^i,oi and TTj^n, are respectively the limiting 
gene frequencies of the i-th gene with alleles 0 and 1, then from the independence between 
genes (see condition (c) in the previous subsection), the limiting frequency of the genotype 
G at equilibrium is 

TTg = ^l,Gi7r2,G2 ■ ■ ■ '^m,Gm ■ 

Remark 1. It is a guite challenging guestion to determine the exact distribution of the 
limit triplet (vtoo, ttoi, tth). Actually our simulations show that it may have a diffuse dis¬ 
tribution in the set {(x, y,z) G [0,1]^ : x -|- ?/ -|- z = 1}, which depends on the initial values 
ttqo, and see 
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Figure 1: Empirical distribution functions of ttoo (blue), tth (red) and ttoi (black). The 
first figure is obtained with initial values tTqq = 1, tTq^ = 2, = 3 and the second one is 

obtained with ttqq = 1, ttq^^ = 1 , = 0. 








Remark 2. A subsequent question to Theorem eoncerns the speed of eonvergence of 
(tTqq, tTq^, Some results in this direction are given in and [^. However, they 

require some strong assumptions on the derivative of the function F at the limiting point 
(ttoo, Tioi, which are quite difficult to verify in our situation, mainly due to the fact that 
we do not know the distribution 0 /(ttoo, vtoi, tth). However, it is reasonable to expect that 
a central limit type theorem holds, in which case, the speed of convergence 0 /(tToq, tToi, 
to (vToo, TToi, vTii) would be of order \/n. 


3 Application of the model 


Our model were tested on a population of 85 living accessions representing the common 
broom Cytisus scoparius and three related interspecihc hybrids. This dataset consists 
in 62 vegetatively propagated cultivars obtained from various nurseries. These cultivars 
belong to either Cytisus scoparius, Cytisus x dallimorei (hybrid between C. scoparius and 
C. multiflorus), C. x praecox (hybrid between C. multiflorus and C. oromediterraneus), 
or C. X booskopii (hybrid between C. x dallimorei and C. x praecox). In addition three 
to nine individuals obtained from hve wild populations have been included (3 individuals 
of Cytisus oromediterraneus from France, 3 individuals of Cytisus scoparius from Italia, 
3 from Poland, 4 from Angers, France and 9 from Ernee, France). For all these samples, 
DNA extration use the Nucleospin(R)Plant II kit from macherey-Nagel. IISR data was ob¬ 
tained using six set of primers, namely ISSR5 (sequence: 5-CACACACACACACACARC- 
3), ISSR7 (sequence : 5-CACACACACACACACART-3), ISSR13 (sequence: 
5-GTGTGTGTGTGTGTGTYA-3), ISSR890 (sequence: 5-VHVGTGTGTGTGTGTGT- 
3), ISSR891 (sequence : 5-HVHTGTGTGTGTGTGTG-3) and ISSRa (sequence: 5- 
GGTGTGTGTGTGTGTG-3). Polymerase chain reaction (PGR) was done using the fol¬ 
lowing parameters : 95°G for 2 min., then 39 cycles of 95°G for 30 sec., 50°G for 30 sec., 
72'’G for 120 sec., followed by 10 min. of extension at 72°G. Electrophoresis was done on 
5% acrylamide-bisacrylamide gel (mixing ratio : 29:1), with 7M urea, with a pre-run of 
30 min at 80 W, then 2h30 at 60W. Staining use silver nitrate. Gels were scanned and 
band manualy read. 


Using data obtained from IS SR analysis, our present aim is to determine the most likely 
pedigree relating these individuals. A code in language R has been written according to 
the model described in the previous sections. The latter applied to our data provided 
the pedigrees presented in hgures and below. The use of this method hrst requires 
that the population we are dealing with satishes principles (a) — (e) in Subsection 2.1 and 
parameters e, 6, pe and q£ must be inferred from our data. 


Breedings have occurred over time under the action of professional breeders or according 
to natural phenomenons and with no more information, assumption (a) about uniform 
prior distribution is reasonable. According to botanists, this is also the case of assumption 
{b) which means that there are no missing individuals in the population. Then we need to 
ensure the independence hypothesis (c) between the bands {xe{g) = 1}, £ G {1 ,... ,rn}. 
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Depence may occur due to the selective sweep phenomenon which can associate together 
several genes whose loci are close to each other along the chromosome. For such sets of 
genes, recombination is not strong enough for them to be considered as independent in 
the reproduction process. Then among the 424 bands, we have selected 168 of them which 
are proved to be independent from a statistical test. 

We also need to determine the values of e, 6, pe and related to the present data, in order 
to construct the probability measure which is defined in (d). First recall that in the ISSR 
amplification, six markers allow us to test the presence or absence of those 168 bands, 
each marker corresponding to a particular set of bands (34 bands for ISSR890, 22 for 
ISSR 891, 31 for ISSRa, 32 for ISSR5, 27 for ISSR7 and 22 for ISSR13). For each of the 
six markers used, in order to apply the above model, we need to estimate the values of 6 
and e (the errors probability, which can occur during the experiment). We achieve this by 
repeatedly crossing two individuals (G017 Cytisus scoparius ’Lunagold’ and GOlO Cytisus 
X dallimorei ’Burkwoodii’) and performing marker analysis (using 5 of the 6 markers used 
for the dataset) on the resulting offspring (n=33 plants). We are then able to estimate, 
for each marker, the value of 6 . Denoting by 6 m the error using marker m, we assume that 
Sm is a Gaussian random variable such that Var((5m) = Var((5m') for all markers m, m'. 
We obtained the following average errors: 


^i^ISSRa) — 0.16, E{6isSR89o) — 0.16, 

^{6issr89i) = 0.14, E,{6issr5) = 0.19, E,{6issr7) = 0.1. 

For each pair of markers, m and m', we ran a hypothesis test to determine whether 
IE(5m) = IE((5m') and we found that we do not reject this null hypothesis at a 95% conh- 
dence level. We obtained a 95% conhdence interval of (0.126,0.195) for the error, under 
the assumption that the errors from the different markers all came from the same distri¬ 
bution. For the present reconstructions we have chosen the value 6 = 0.15. The same 
study for the error e leads us to the choice of e = 0.05. 


In subsection 2.2 we proved convergence of gene frequencies and we will assume that the 
population which is considered here has attained some equilibrium, that is principle (e). 


As can be seen from equation (2.1), thanks to Hardy-Weinberg principle, the probabilities 


Pi and qi only depend on the probability ttoq. We emphasize that the latter probability is 
actually the only one whose empirical value can be determined from the data. Indeed it 
is not possible to distinguish the genotype 01 from the genotype 11 in ISSR data. In the 
present case, we obtain the values of ttoo and hence pi and qi for each band. 


The probabilities k) defined in the end of Subsection 2.1 may appear quite low once 
computed from our dataset. However knowing that all individuals belong to the same 
family, we are only concerned with their relative values. The pedigrees appearing in figures 
[^|3]and|4] were obtained with the threshold probabilities 0.1 and 0.2 and 0.3 respectively. 
Funders have been represented in black and individuals with no parent and children have 
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not been represented. As expected, when the threshold probability p increases, the nnniber 
of relations between individuals decreases and more individuals are considered as founders. 
Compared to the existing knowledge we have on the group (see P]), several relationships 
are congruent with historical information. For example, ’Zeelandia’ is reported as a 
descendant of ’Burkwoodii’ and a C. x praecox. This relationhip appears with all threshold 
probabilities. ’Liza’, ’Andreanus Select’, and ’Donard Seedling’ are all all historically 
reported as sport (bud mutations) of ’Burkwoodii’, while ’Lena’ is supposed to be a 
seedling of it. They are all linked under p = 0.1 and p = 0.2, while under higher threshold 
probability ’Burkwoodii’, ’Liza’ and ’Andreanus Select’ are still linked, however, Donard 
Seedling is treated as a seedling of ’Burkwoodii’ and Cytisus ardoinoi which may be 
impossible (the sample used for representing this last species being wild collected). ’Firefly 
is reported as a seedling of ’Andreanus’, which appears under all threshold probabilities. 
Comparing to historical information, ’La Coquette’ appears here as founder, and as parent 
of ’Roter Favorit’ while it was reported as a self-fecondation of ’Hollandia’, and half- 
brother of ’Boskoop Ruby’. ’Hollandia’ is know to be a seedling from ’Burkwoodii’ and C. 
X praecox, here, under p=0.1, it is a seedling between the same ’Burkwoodii’ but with C. 
scoparius. Using the same ISSR data, Auvray in pQ points out the putative link between 
’Apricot Gem’ and ’Dukaat’, as well as between ’Boskoop Ruby’ and ’Windlesham’. These 
links are re-inforced here and second putative parents are provided (kewensis for ’Apricot 
Gem’ and ’Hollandia’ for ’Windlesham’). Auvray [1] also point out a parentage between 
’Moclard Pink’ and ’Minstead’ (the former being a putative seedling of the later), here 
’Moclard Pink’ is always linked with ’Albus’, a point which needs consideration. Under 
the various threshold probabilities, ’Luna’, ’Palette’ and ’Roter Favorite’ are linked, this 
seems reasonably consistent with the fact that they all have been obtained form the 
same nursery (Arnold, at Alreslohe near Holstein in Germany) around 1960. ’Jessica’, 
linked to the same group under p = 0.1 is of unknown parentage, while ’Goldhnch’, also 
linked under p = 0.1 is reported to be a seedling between ’Donard Seedling’ and ’Dorothy 
Walpole’ (laking from the sampling). The links between ’Andreanus’, ’Firefly’, ’Golden 
Sunlight’ ’Andreanus Splendens’, ’Golden Gascade’, ’Roter Favorite’ and ’Queen Mary’, 
appearing under all threshold probabilities, reminds that all these cultivars are selection 
of G. scoparius and not of any of the interspecihc hybrids. 


4 Discussion 


We have set up a mathematical model of pedigree reconstruction whose basic principle is 
to determine, for each individual, what is the most likely parent pair in the population, 
according to the probability distribution which is dehned in (d) of Subsection 2.1 The 


robustness of this model mainly relies on the fact that gene frequencies have attained some 


equilibrium. We show in Subsection 2.2 that indeed, in the absence of any evolutive forces, 
gene frequencies converge toward a limit random vector which satishes Hardy-Weinberg 
equilibrium. From this model we derived an algorithm which is written in language R and 
then we applied this model to ISSR data from a population of diploid plants. The results 
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Figure 2: Threshold probability p = 0.1. 
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Figure 3: Threshold probability p = 0.2. 
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Figure 4: Threshold probability p = 0.3. 

reveal that the pedigrees obtained from this method fit to the partial reconstructions based 
on botanical data or other methods using dendograms obtained from matrix distances. 
This additional source of information could also be used in order to improve the model by 
constructing a new probability distribution giving a relative weight to each kind of data. 

Greater power could also be given to our method by getting rid of assumption (6) on 
non missing individuals. Indeed missing individuals in the population who would actually 
have lots of family relationships could considerably distort the real pedigree. Then an 
improvement would consist in determining how much the addition of one or several virtual 
individuals with specific genomes increases the likelihood of the pedigree. 

Principle (c) assumes that recombination is uniform, but this can be made more realistic 
by determining how different sets of loci actually recombines from a preliminary statistical 
inference. Then the model can easily be adapted. 

Finally we emphasize that our model can be applied to phenotyped data. Indeed, as 
already observed in Section the knowledge of ISSR is equivalent to the knowledge of 
the expression of a dominant gene. Hence our model can easily be tested from a population 
about which we observe a specific set of phenotypical criteria and whose family relationship 
are a priori known. 
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